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Abstract 

We have developed PGPG (Pipeline Generator for Programmable GRAPE), a 
software which generates the low-level design of the pipeline processor and commu- 
nication software for FPGA-based computing engines (FBCEs). An FBCE typically 
consists of one or multiple FPGA (Field-Programmable Gate Array) chips and lo- 
cal memory. Here, the term 'Field-Programmable" means that one can rewrite the 
logic implemented to the chip after the hardware is completed, and therefore a single 
FBCE can be used for calculation of various functions, for example pipeline pro- 
cessors for gravity, SPH interaction, or image processing. The main problem with 
FBCEs is that the user need to develop the detailed hardware design for the pro- 
cessor to be implemented to FPGA chips. In addition, she or he has to write the 
control logic for the processor, communication and data conversion library on the 
host processor, and application program which uses the developed processor. These 
require detailed knowledge of hardware design, a hardware description language such 
as VHDL, the operating system and the application, and amount of human work is 
huge. A relatively simple design would require 1 person-year or more. The PGPG 
software generates all necessary design descriptions, except for the application soft- 
ware itself, from a high-level design description of the pipeline processor in the PGPG 
language. The PGPG language is a simple language, specialized to the description 
of pipeline processors. Thus, the design of pipeline processor in PGPG language is 
much easier than the traditional design. For real applications such as the pipeline 
for gravitational interaction, the pipeline processor generated by PGPG achieved the 



performance similar to that of hand-written code. In this paper we present a detailed 
description of PGPG version 1.0. 

Key words: methods: n-body simulations, methods: numerical 

1. Introduction 

Astronomical many-body simulations have been widely used to investigate the forma- 
tion and evolution of various astronomical systems, such as planetary systems, globular clusters, 
galaxies, clusters of galaxies, and large scale structures. In such simulations, we treat plan- 
etesimals, stars, or galaxies as particles interacting with each other. We numerically evaluate 
interactions between the particles and advance the particles according to Newton's equation of 
motion. 

In many cases, the size of an astrophysical many-body simulation is limited by the 
available computational resources. Simulation of pure gravitational many-body system is a 
typical example. Since the gravity is a long-range interaction, the calculation cost is 0(N 2 ) 
per timestep for the simplest scheme, where N is the number of particles in the system. We 
can reduce this 0(N 2 ) calculation cost to O(NlogN), by using some approximated algorithms, 
such as the Barnes-Hut treecode (Barnes, Hut 1986), but the scaling coefficient is pretty large. 
Thus, the calculation of the interaction between particles is usually the most expensive part 
of the entire calculation, and thus limits the number of particles we can handle. Smoothed 
Particle Hydrodynamics (SPH, Lucy 1977, Gingold, Monaghan 1977), in which particles are 
used to represent the fluid, is another example. In SPH calculations, hydrodynamical equation 
is expressed by short-range interaction between particles. The calculation cost of this SPH 
interaction is rather high, because the average number of particles which interact with one 
particle is fairly large, typically around 50, and the calculation of single pairwise interaction is 
quite a bit more complex compared to gravitational interaction. 

Astrophysics is not the only field where the particle-based calculation is used. Molecular 
dynamics (MD) simulation and boundary element method (BEM) are examples of numerical 
methods where each element of the system in principle interacts with all other elements in 
the system. In both cases, approaches similar to Barnes-Hut treecode or FMM (Greengard, 
Rokhlin 1987) help to reduce the calculation cost, but the interaction calculation dominates 
the total calculation cost. 

One extreme approach to accelerate the particle-based simulation is to build a special- 
purpose computer for the interaction calculation. Two characteristics of the interaction calcu- 
lation make it well suited for such approach. Firstly, the calculation of pairwise interaction is 
relatively simple. In the case of gravitational interaction, the total number of floating-point 
operations (counting all operations, including square root and divide operations) is only 20. So 
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it is not inconceivable to design a fully pipelined, hardwired processor dedicated to the calcu- 
lation of gravitational interaction. For other application like SPH or molecular dynamics, the 
interaction calculation is more complicated, but still hardware approach is feasible. Secondly, 
the interaction is in its simplest form all-to-all. In other words, each particle interacts with 
all other particles in the system. Thus, there is lots of parallelism available. In particular, it 
is possible to design a hardware so that it calculate the force from one particle to many other 
particles in parallel. In this way we can reduce the required memory bandwidth. Of course, 
if the interaction is of short-range nature, one need to implement some clever way to reduce 
calculation cost from 0(N 2 ) to O(N), and the reduction in the memory bandwidth is not as 
effective as in the case of true 0(N 2 ) calculation. 

The approach to develop specialized hardware for gravitational interaction, materialized 
in the GRAPE ("GRAvity piPE") project (Sugimoto et al. 1990; Makino and Taiji 1998), has 
been fairly successful, achieving the speed comparable or faster than the fastest general-purpose 
computers for the price tag one or two orders of magnitude smaller. For example, GRAPE- 
6, which costed 500M JYE, achieved the peak speed of 64 Tflops. This speed is favorably 
compared to the peak speed of the Earth Simulator (40Tflops) or ASCI-Q(30Tflops), both 
costed several tens of billions of JYE. A major limitation of GRAPE is that it cannot handle 
anything other than the interaction through 1/r potential. It is certainly possible to build a 
hardware that can handle arbitrary central force, so that molecular dynamics calculation can 
also be handled (Ito et al. 1993; Fukushige et al 1996; Narumi et al. 1999; Taiji et al 2003). 

However, to design a hardware that can calculate both the gravitational interaction and, 
for example, an SPH interaction is quite difficult. Actually, to develop the pipeline processor 
just for SPH interaction turned out to be a rather difficult task (Yokono et al. 1999). This is 
provably because the SPH interaction is much more complex than gravity. 

Computing devices which uses FPGA (Field-Programmable Gate Array) chips could 
offer the level of flexibility that was impossible to achieve with the conventional GRAPE ap- 
proaches. As its name suggests, FPGA is a mass-produced LSI chip, consisting of a large 
number of logic elements and switching network. By programming these logic elements and 
switching network, we can implement an arbitrary logic design, as far as it can fit to the chip 
used. Thus, a single hardware can be used to implement various pipeline processors, such as 
that for gravity, SPH, and others. Such FPGA-based "reconfigurable" computing device has 
been an active area of research since Splash-1 and Splash-2 (Buell et al. 1996), and several 
groups, including ourselves, have tried to apply the idea of reconfigurable computing to particle 
simulations (Kim et al. 1995, Hamada et al. 2000, Spurzem et al. 2002). Hamada et al. (2000) 
called this approach "Programmable GRAPE" or PROGRAPE. 

Figure 1 shows the basic structure of a PROGRAPE system. It consists of a pro- 
grammable GRAPE hardware and a host computer. The programmable GRAPE hardware 
typically is composed of FPGA chips to which the interaction pipelines are implemented, a 



3 



Host 
Computer 



Position etc. 

► 

< 

Force etc. 





Interface 






Unit 






& 






Particle 






Memory 





Interaction 
Pipeline 
(on FPGA) 



Programmable GRAPE 



Fig. 1. Basic structure of a programmable GRAPE(PROGRAPE). 

particle memory, and an interface unit, and calculates the interaction fj between z-the particle 
and other particles expressed as 



2jG(ai,a,-), 



(1) 



where a^ is the physical value of the i-th particle, such as position and velocities, and G() is 
a user-specified function. We specify the function GQ by programming FPGA. The physical 
values dLj of all particles are stored in the particle memory and supplies them to the interaction 
pipeline. The physical values a« are stored in registers of the interaction pipeline. The inter- 
face unit controls communications between the programmable GRAPE hardware and the host 
computer. The host computer performs all other calculations. 

FPGA-based PROGRAPEs have several important advantages over conventional full- 
custom GRAPE processors. One is that the development cost of the chip itself is paid by the 
manufacturer of the chip, not by us. Thus, initial cost is much lower. This low development cost 
means that new hardwares can be developed in shorter cycle. Large GRAPE hardwares took 
several years to develop, and this means the device technology used in GRAPE hardwares, even 
at the time of its completion, is a few years old. This delay implies quite a large performance 
hit. 

Thus, even though the efficiency in the transistor usage is much worse than full-custom 
GRAPE processors, the actual price-performance of a PROGRAPE system is not so bad, if 
one condition is satisfied: If the design of the pipeline processor to be implemented in FPGA 
and other necessary softwares can be developed sufficiently fast. Previous experiences tell us 
that it is not the case. To implement a relatively simple pipeline for gravitational interaction 
calculation took more than one person-year, and implementation of even a simple SPH pipeline 
would take much more. Thus, clearly the difficulty of the software development has been the 
limiting factor for the practical use of PROGRAPE or other FPGA-based computing device. 

The difficulty is partly because we have to design the interaction pipeline itself, for 
which we need rather detailed and lengthy description of hardware logic in hardware-description 
languages such as VHDL. In addition to the pipeline itself, we also need to develop the control 
logic for the pipeline and communication to the host, driver software on the host computer, 
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and software emulator library used to verify the design (see section 2). 

In theory, most of the design description of softwares and hardwares, including the bit- 
level design of the interaction pipeline itself, can be automatically generated from some high- 
level description of the pipeline itself. The basic idea behind the PGPG (Pipeline Generator for 
Programmable GRAPE) system, which we describe in this paper, is to realize such automatic 
generation. PGPG generates all necessary hardware design descriptions and driver softwares, 
from high-level description of the pipeline processor itself. Thus, the user is relieved of the bur- 
den of learning complex VHDL language. Also, the driver software is automatically generated, 
so that the user can concentrate on writing the application program, not the low-level driver 
software for a specific hardware. Thus, we can dramatically reduce the amount of the work 
of the application programmer. More importantly, when a new hardware becomes ready, once 
the PGPG system is ported, all user applications developed on it works unchanged. The effort 
spent to design one application on one hardware will not be thrown away when new hardware 
becomes available. 

In this paper, we describe the PGPG system version 1.0. In section 2, we describe the 
traditional design flow and its problem. In section 3 we describe the basic concept and structure 
of PGPG. In section 4, we show a design of gravitational force pipeline as an example of pipeline 
generated by PGPG. Section 5 is for discussion. Table 1 is a glossary for abbreviations used in 
the paper. 

2. Traditional Design Flow for FPGA-based Computing Engines 

In the traditional design flow, we design the FPGA-based computing system in the 
following five steps. 

(A) Target Function Specification: 

We specify the target function, namely the function that the pipeline processor calculates. 
This includes the specification of the input data (number format and word length), the 
dataflow for the calculation of the function, input and output number format and word 
length for each arithmetic operation. 

(B) Bit-Level Software Emulator: 

We develop a software emulator which implements the target function defined in step (A) 
in software. Using this software emulator, we verify whether the designed hardware can 
actually calculate the target function with required accuracy. In this step, we also define 
the application program interface (API). 

(C) Hardware Design: 

In this step we actually write the source code which implements the pipeline processor in 
a hardware-description language (HDL) such as VHDL. In addition, we design the control 
logic and host interface logic also in some HDL. The HDL description is compiled to 
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Table 1. Abbreviation Glossary 



API Application Program Interface 

BEM Boundary Element Method 

CAE Computer Aided Engineering 

DFT Discrete Fourier Transform 

FBCE FPGA-Based Computing Engine 

FMM Fast Multipole Method 

FPGA Field Programmable Gate Array 

GRAPE GRAvity PipE 

HDL Hardware Description Language 

LPM Library of Parameterized Modules 

LSI Large Scale Integration 

MD Molecular Dynamics 

PGDL PGPG Description Language [defined in this paper] 

PGPG Pipeline Generator for Programmable GRAPE [defined in this paper] 

PROGRAPE PROgrammablc GRAPE 

SLDL System-Level Design Language 

SPH Smoothed Particle Hydrodynamics 

VHDL VHSIC Hardware Description Language 

VHSIC Very High Speed Integrated Circuit 



the configuration data for the FPGA chips by a design software, usually provided by the 
manufacturer of the FPGA device. 

(D) Interface Software: 

We develop the software on the host computer which takes care of the communication to 
the hardware and data format conversion between the floating-point data on the host and 
specialized data format used on the developed pipeline processor. The developed software 
should have the same API as that of the software emulator developed in step (B). 

(E) Finally, we can actually use the pipeline processor with real application program, by 
combining the hardware, hardware configuration data, interface software and application 
software. 

Figure 2 summarizes these steps. In these steps, we have to design, test, and debug a 
large amount of hardware logic and softwares. Of course, many of the softwares and hardware 
designs can be reused, when we develop different applications. For example, the design of the 
floating-point multiplier is rather generic, and can be used in almost any application. Also, it 
is possible to buy such design, or design software which generate floating-point arithmetic unit 
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Fig. 2. Traditional design flow for FPGA-based computing system. Colored items mean hardware logics 
and softwares that we have to develop. 

with arbitrary word length, from some CAE company. However, just to understand how to use 
such a design, one need a deep understanding of the hardware and HDL used for that particular 
design. Thus, even though the reusability significantly reduce the amount of the work needed 
for the second and later design for one person, the initial hurdle remains rather high, for an 
astrophysicist who never used such software, or actually the availability of the library make the 
hurdle even higher, since a starter need to understand, in addition to the basics of the hardware 
design and HDL, the use of such libraries and particular design software for that library. 

The development of the interface software is generally even more difficult than the design 
of the hardware, since it requires the knowledge of how the device driver softwares work in the 
operating system of the host computer, and infinite number of small details like how to integrate 
the device driver to the operating system, how to correctly generate the compiler flags to compile 
the device driver so that it works with the kernel installed on the host computer etc etc. 

All these works combined make it almost impossible for an astrophysicist to even think 
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Fig. 3. Design flow with PGPG. A colored magenta item means the only file that we have to describe, 
of implementing the pipeline processor on an FPGA-based computing engine. 

3. The PGPG system 

3.1. Basic Idea of PGPG 

If we inspect Figure 2 again, we can see the fact that all softwares and hardware descrip- 
tion is derived from the target function specification in step (A). Thus, it should be possible for 
a sufficiently smart software to generate all necessary softwares and hardware descriptions from 
the target function description written in some high-level language. The basic idea of PGPG 
is to develop such a smart software. 

Figure 3 shows how the design flow changes with PGPG. After we define the target 
function, we write it in the high-level specification language, the PGPG description language 
(PGDL). The PGPG software system takes this PGDL description of the pipeline processor 
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Fig. 4. Block diagram of the example (artificial) pipeline. 

as input, and generates all softwares and hardware descriptions. Thus, with PGPG, an astro- 
physicist do not have to write VHDL source code for the pipeline processor or C source codes 
for interface library. 

In the rest of this section we illustrate how a pipeline processor is specified in PGDL 
and how that description is translated to actual codes. 

3.2. Example Target 

We consider the following (artificial) example: 



,n) 



(2) 



This function is designed purely to show how the PGDL description and PGPG software work. 
Figure 4 shows the pipeline itself. "Particle" here is represented by a single scalar value a. The 
interaction between particles % and j is defined as the product OjOj, and we calculate sum over 
j to obtain the "force" on particle i. Here, we have the essential ingredients of the system: 
particles, their representation, functional form of interaction. 

For the particle data (and a,), we use a logarithmic format, with 17 bits in total (1 bit 
for sign, 1 bit for zero or not, 7 bits for the integer part of logarithm and 8 bits for fractional 
part). The base of the logarithm is 2. This logarithmic format has the advantage that the 
multiplication becomes addition, so we do not need a multiplier circuit whose size is 0(m 2 ), 
where m is the length of the mantissa. Of course, the addition in logarithm format is more 
complex than that in the floating-point format. Thus, the relative advantage of the log format 
is not very large. The "multiplier" logic itself is generated automatically by PGPG. 

In our example target function, we convert the output of multiplier to fixed-point format, 
so that we can accumulate it with high accuracy. This is done by a circuit provided by PGPG. 
Finally, converted result is accumulated by a usual fixed-point adder circuit. 

The particles with index j is stored in the memory, and new data is supplied at each 
clock cycle. The particle with index % is fixed during one calculation, and is stored in the 
register within the pipeline processor. 

Figure 5 shows the PGDL description of this target function. The first two lines define 
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#define ascale (pow(2 . ,20 . 0) ) 
#define fscale (1 . 0/ (ascale*ascale) ) 

/NVMP 1; 
/NPIPE 2; 

/JPSET iaj ,aj □ , log, 17 ,8 ,ascale ; 
/IPSET iai,ai[] , log, 17 , 8 , ascale ; 
/FOSET sfij.f [], fix, 64, fscale; 

pg_log_muldiv(MUL, iaj , iai ,aij , 17, 1) ; 
pg_conv_ltof (aij ,f ij ,17,8,64, 1) ; 
pg_f ix_accum(f ij , sf ij , 64, 64, 1) ; 



Fig. 5. An example of design entry file written in PGDL 
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Fig. 6. Block diagram of the special-purpose processor generated from a PGDL program 

formulae used for the data format conversion between the internal data format (logarithmic for 
Oj and fixed-point for /j). These are actually used in the next block, which defines the interface 
etc. The next block (lines starting with "/") defines the register and memory layout, which also 
determines API. The final part describes the target function itself. It has C-like appearance, 
but actually defines the hardware modules and their interconnection. In the next subsection 
we describe the PGPG language in more detail. 

3.3. The PGDL Language 

In this section, we give a minimal description of the PGDL language. A full description 
is available in http://progrape.jp. 
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3.3.1. PGDL Target Hardware Model 

Figure 6 gives the structure of the special-purpose processor generated from a PGDL 
"program". It consist of the control logic, I/O logic, program-specified registers and a memory 
unit, and the pipeline unit. Program-specified registers are either input registers, which we 
call z-particle registers, or registers which accumulate the calculated interaction, which we call 
force-accumulation registers. We call memory unit j'-particle memory. 

In figure 5, the pipeline processor is specified by list of modules (lines with pg_. . .). 
Registers and memories are specified by lines with /IPSET (z-particle register), /JPSET (j- 
particle memory), and /FOSET (force-accumulation register). 

This hardware model is general and flexible enough to express any special-purpose com- 
puter which calculates the function of the form of equation (1). We use the analogy of particles 
and forces, but the actual data in the i-particle register or j-particle memory need not repre- 
sent physical particles, and "force" can be something completely different. For example, this 
hardware model can be used to describe Discrete Fourier Transform (DFT) or other types of 
convolution operations. 

3.3.2. PGDL API Model 

Currently, one PGDL program generates one function prototype, void force (. . .), 
with list of arguments. The list of arguments consists of the data to be stored to i-particle 
registers, that to be stored to j-particle memories, and the data to be returned (the content 
of force accumulation registers). For the body of the function, both the emulator function and 
actual driver and data conversion function are generated, and the user can use either one by 
linking appropriate object file, without changing the source file. In figure 5, second arguments 
of /JPSET, /IPSET and /FOSET lines determine the name of the arguments which corresponds 
to the specified register or memory element. 

3.3.3. PGDL Program Structure 

A PGDL program consists of the following sections: 

1. macro declaration 

2. generic declaration 

3. interface declaration 

4. pipeline description 

The macro declaration, which is the first two lines of code in figure 5, is processed by C 
preprocessor (cpp) and used just for convenience to allow the same expressions which appear 
in multiple places to be defined only once. 

The generic declaration in figure 5 are two lines: 

/NVMP 1; 
/NPIPE 2; 
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The first line determines the degree of the virtual multiple pipeline (Makino et al. 1997). 
The second one is the number of physical pipelines implemented to the current design. Thus, we 
can change the physical number of pipelines by just change this parameter, and the application 
program can make use of the parallel pipeline without any need to change the user code. 

The interface declaration is the following part: 

/JPSET iaj ,aj [], log, 17,8, ascale; 
/IPSET iai , ai [] , log ,17,8, ascale ; 
/FOSET sf ij ,f [], fix, 64, f scale; 

The first argument is the name used for the registers and memories in the pipeline 
description, and the second one is the name used for API. the remaining arguments specifies 
the number formats. In this example, both and aj are in the logarithmic format, with 17 
bits of the total word length and 8 bits of mantissa. 

Finally, the pipeline description is the following part: 

pg_log_muldiv(MUL, iaj , iai , aij , 17, 1) ; 
pg_conv_ltof (aij ,fij ,17,8,64,1) ; 
pg_f ix_accum(f ij ,sf ij ,64,64, 1) ; 

Here, pg_log_muldiv generates one multiplier in the logarithmic format, which takes 
two inputs, iaj and iai, and calculates one output result, aij. The rest of arguments, 17 and 
1 indicate the bit length and number of pipeline stages, respectively. The inputs are taken from 
the j-particle memory and the 2-particle register with the corresponding names, and the output 
becomes the input to the next module pg_conv_ltof . This module converts the logarithmic 
format to fixed-point formant. Finally, module pg_f ix_accum accumulates the result, and the 
value of this accumulator, sf ij is accessible from the application program with name f, as 
specified in /FOSET declaration. 

3.3.4. PGDL Arithmetic Modules 

The present version of PGDL supports the following two number format: (a) fixed-point 
format and (b) logarithmic format. For the fixed-point format PGDL supports addition (and 
accumulation as well), subtraction, and conversion to the logarithmic format. For the logarith- 
mic format, multiplication, division, power functions (with rational powers), and conversion 
to fixed-point format are supported. Appendix 1 gives more detailed discussion of the PGDL 
language elements. 

4. A Real Example: Gravitational Force Pipeline 

In this section, we discuss the implementation of the gravitational force calculation in 
PGDL in detail. We chose the gravitational force calculation as the example target, because 
we can compare the performance and size of PGDL-generate design with hand-coded ones such 
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Fig. 7. Block diagram of the pipeline for gravitational force 



as the pipeline design of GRAPE-5. 

The pipeline to be designed calculates the gravitational force on particle v. 



m j r ij 



V(4+ £2 ) 3/2 (3) 

where aj is the gravitational acceleration of particle i, rj and m,j are the position and mass of 
particle i, r,y = r^- — r,j, and e is a softening parameter. Here we design the pipeline essentially 
the same as that of GRAPE-3 and GRAPE-5, to compare the performance and size. 

4-1. The PGDL Pipeline Design Description 

Figure 7 shows the block diagram of the gravitational force pipeline. Position data 
for both z-particle and j-particle are in the fixed-point format, while mj is in the logarithmic 
format. After subtraction, Xj — Xj, the results are converted to the logarithmic format, and all 
calculations until the final accumulation are done in this logarithmic format. 

Figure 8 shows the PGDL program file for the gravitational force pipeline in figure 7. 
One can see that each pg module in figure 8 directly corresponds to arithmetic units in figure 
7. Actually, the PGDL description is more compact, since it allows implicit array operation, as 
in the case of the first line: 

pg_f ix_addsub(SUB,xi,xj ,xij ,NP0S,1) ; 

Here, three modules are generated automatically, because both xi and xj are declared as the 
array of size 3 in the interface declaration. Also, there is no need to explicitly specify the 
wait (pipeline delay) modules, since the PGDL compiler inserts the necessary delay element 
automatically. 

The top-level interface function to the application program generated from this PGDL 

13 



#define xscale (pow(2 . , 32 . 0) /64 . 0) 
#define mscale (pow(2 . , 60 . 0) / (1 . 0/1024. 0) ) 
#define escale (xscale*xscale) 
#define fscale (-xscale*xscale/mscale) 
#define NPOS 32 
#define NLOG 17 
#deflne NMAN 8 
#deflne NFOR 57 
#deflne NACC 64 

/NVMP 2; 
/NPIPE 2; 

/JPSET xj [3] ,x[] [] ,ufix,NPOS, xscale; 
/JPSET mj,m[] , log , NLOG , NMAN , mscale 
/IPSET xi[3] ,x[] [] ,ufix, NPOS, xscale; 
/IPSET ieps2 , eps2 , log, NLOG , NMAN , escale ; 
/FOSET sx[3] ,a[] □ , f ix , NACC , f scale ; 

pg_f lx_addsub(SUB,xi,xj ,xij , NPOS, 1) ; 
pg_conv_ftol(xij ,dx, NPOS, NLOG, NMAN, 4) ; 
pg_log_shift(l,dx,x2,NL0G) ; 

pg_log_unsigned_add(x2[0] ,x2[l] ,x2y2,NL0G,NMAN,3) ; 
pg_log_unsigned_add (x2 [2] , ieps2 , z2e2 , NLOG , NMAN , 3) ; 
pg_log_unsigned_add (x2y2 , z2e2 , r2 , NLOG , NMAN , 3) ; 
pg_log_shift(-l,r2,rl,NL0G) ; 
pg_log_muldi v (MUL , r2 , r 1 , r3 , NLOG , 1 ) ; 
pg_log_muldi v (DIV , mj , r 3 , mf , NLOG , 1 ) ; 
pg_log_muldi v (MUL , mf , dx , f x , NLOG , 1 ) ; 
pg_conv_ltof (fx,f fx, NLOG, NMAN, NFOR, 2) ; 
pg_f ix_accum(f f x , sx , NFOR, NACC , 2) ; 



Fig. 8. A sample design entry file for the gravitational force pipeline written in PGDL 
program has the following form: 

void force(double x[] [3] , double m[] , double eps2, double a[] [3] , int n) ; 

In this example, positions of i particles and j particles are passed as a single array x, since 
the same name is used in /IPSET and /JPSET. Thus, with this interface it is only possible to 
calculate the self-gravity of an iV-body system. 

4-2. Performance of Generated Pipeline 

Here we report the performance of the PGDL-generated gravitational force calculation 
pipeline, and compare that with that of GRAPE-3 and GRAPE-5. We summarize internal 
number expressions for the pipeline designs in table 2. 

Table 3 shows the size and performance of the generated pipeline (model G5) for several 
implementations with different number of pipeline stages, for two different kinds of the Altera 
device. The pipeline designs with different number of pipeline stages can be easily obtained by 
a small modification of the design entry file in PGPG. The size and maximum operation speeds 
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Table 2. Model 



Model 


Position 


Internal(mantissa) 


Accumulation 


G3 


20bit fixed 


14(5)bit log 


56bit fixed 


G5 


32bit fixed 


17(8)bit log 


64bit fixed 


G5+ 


32bit fixed 


20(ll)bit log 


64bit fixed 



Table 3. Performance of the generated pipeline (model G5) 





APEX20k 






Stratix 




stage 


size(LE) 


/ max (MHz) 


stage 


size(LE) 


/ max (MHz) 


14 


2735 


58.92 


17 


2499 


133.30 


16 


2928 


60.97 


19 


2655 


137.51 


17 


2925 


73.78 


21 


2849 


142.29 


20 


3074 


74.65 


23 


2927 


135.78 


21 


3064 


80.33 


24 


2864 


142.88 



are those reported by Altera's design software, Quartus II(ver 3.0). The speed grade of these 
devices are (-2) for APEX20k and (-5) for Stratix (fastest available at the time of the writing). 

Table 4 shows the size and performance of pipelines with different accuracy (G3, G5 
and G5+). One can see that both the performance penalty and size increase due to increased 
accuracy of G5+, compared to G3 or G5, are fairly modest. 

With currently available FPGA (Stratix EP1S20), we can fit 5 G5 pipelines running 
at 180 MHz into one chip. The original GRAPE-5 pipeline chip, which was made 7 years 
ago, had two pipelines operating at 80 MHz clock. Thus, FPGA implementation of GRAPE-5 
has about 5 times more speed than the original custom-chip implementation. The peak speed 
of one FPGA chip for GRAPE-5 pipeline is 34.2 Gflops. Of course, this large improvement 
over GRAPE-5 is due primarily to the advance in the semiconductor technology in the 7 years 
(from 0.5/im to 130 nm), but clearly indicates that FPGA-based computing engine does offer 
very good performance, and that PGDL provides a practical tool to implement special-purpose 
computers on FPGA-based computing engines. 



Table 4. Performance of generated pipelines (Stratix) 



Model 


/max 


Size 


Memory 


Stage 




(MHz) 


(LE) 


(bit) 




G5 


181.82 


3021 


41k 


30 


G3 


191.64 


2369 


21k 


26 


G5+ 


142.27 


5082 


402k 


35 
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5. Discussion 



5.1. Comparison with Other Design Methodology 

In other area, such as digital signal processing, there exist many code generators that 
generates HDL code from a simple description. For example, a commercial package (MATLAB) 
generates HDL code for fixed-point filter designed using itself. 

Recently, a design methodology called the system-level design has become popular. In 
the system-level design, the function and architecture of LSI or FPGA are described using 
C/C++ languages or subset of them. These languages are called the System-Level Description 
Language (SLDL). Using SLDL, programmers can verify functionality and performance at the 
early stage of the development. The design is divided into software part and hardware part. 
The hardware part will be synthesized to register-transfer design by a SLDL design software. 
The SpecC, System-C, or Handel-C are the well-known SLDLs commercially available. There 
are also a number of research projects to design hardware using C++/Java languages (e.g. 
Hutchings 1999, Mencer 2002, Tsoi 2004). 

The goal of these SLDL is to describe hardware logic without using hardware description 
language, such as VHDL or Verilog HDL. Therefore, even if we use SLDL, we still need a detailed 
description of the hardware in other language like C or C++. If we consider in the traditional 
flow (Figure 2), we can save step (C) and (D) using SLDL, while step (B) is still required. 
Using PGPG, we can replace all of steps by writing a short high-level hardware description. 

5.2. Planned and Ongoing Improvement of PGPG and PGDL 

In this paper, we described basic concept and functions of PGPG. For those who 
are more interested, we put a CGI program of the current version of PGPG at a web site 
(http://progrape.jp). The CGI program generates VHDL code, user interface code, and 
emulator code from a PGDL description. 

Although the current version of PGPG is successful in designing the pipeline for the 
gravitational force, its functionality is rather limited. We are currently developing the next 
version of PGPG that supports more functionality and multiple hardwares. For the next 
version, we plan to add the modules needed to design a pipeline for the SPH simulation and 
Boundary Element Method (BEM). BEM is one of methods to solve numerically the boundary 
value problem of partial differential equation (Brebbia 1978). The floating-point format with 
longer mantissa is needed for these applications. We are now further developing the support of 
the floating-point arithmetic module for PGPG. We also plan to support Xilinx FPGA chips 
as well as Altera chips. As the target board for Xilinx device, we use Bioler3/HORN-5 board, 
developed by Chiba University and RIKEN (Ito et al 2004). 

This research was partially supported by the Grants-in-Aid by the Japan Society for 
the Promotion of Science (14740127) and by the Ministry of Education, Science, Sports, and 
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Table 5. PGPG version 1.0 feature 
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fixed point format accumulator 
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unsigned logarithmic format adder 
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Definition 




number of pipeline 




/1WMP 
/ l\l V PIT 


IlLLIIlL.Jd Ul VII L Hell lllLllLipiC pipCllIlL. 




/ Jr orj 1 


I11(_-II1(JI V 11I11L oL L l nift 




/IPSET 
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/FOSET 


output register setting 


Device support 




Altera's FPGA 


Hardware support 




PROGRAPE-2 


Other 




Options for look-up table 



Culture of Japan (16684002). 

Appendix 1. Description of PGDL Declarations and Available Modules 

Table 5 shows the features of the current version of PGPG. The specification of version 
1.0 is determined so that a pipeline for gravitational force, shown in section 4, can be constructed 
as the first step. 

PGPG version 1.0 supports nine parametrized modules as shown in Table 5. The 
bit length and the number of pipeline stage for each module can be changed by the 
arguments. For example, the arguments of the fixed point format adder /subtracter 
pg_f ix_addsub(SUB ,xi ,xj ,xij ,32, 1) indicate an operation flag(adder or subtracter), the first 
input, the second input, output, bit length, and number of pipeline stages, respectively, from 
the first to sixth argument. 

Modules pg_f ix_addsub and pg_f ix_accum are fixed point format adder/subtracter and 
sign-magnitude accumulator, respectively. Modules pg_log_muldiv and pg_log_unsigned_add 
are logarithmic format multiplier /divider and unsigned adder, respectively. In the logarithmic 
format, a positive, non-zero real number x is represented by its base-2 logarithm y as x = 2 y . 
The logarithmic format has been adapted for the gravitational pipeline because it has larger 
dynamics length for the same word length and operation such as multiplication and square 
root are easier to implement than in the usual floating-point format. For more details of the 
logarithmic format, see GRAPE-5 paper (Kawai et al. 2000). Module pg_log_shift is a 
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logarithmic format shifter. Shift operations in the logarithmic format express square (left shift) 
and squared root (right shift). Module pg_conv_f tol is a converter from the fixed point format 
to the logarithmic format, and pg_conv_ltof is a converter from the logarithmic format to the 
fixed point format. In PGPG version 1.0, these modules are described partly using the Altera's 
LPM. A gap of delay timing is synchronized automatically by the PGPG. 

Addition to the parametrized modules, five definitions are defined in PGPG version 
1.0. Definitions /NPIPE and /NVMP define the numbers of (real) pipeline and virtual multiple 
pipeline (Makino et al. 1997), respectively. Definition /JPSET defines a setting for the memory 
unit. Definitions /IPSET and /FOSET define settings for the input and output registers in the 
interaction pipeline, respectively. 

Appendix 2. Details of The Generated Gravitational Force Pipeline 

In this appendix, we show a part of the code generated by PGPG from PGDL description 
of the force calculation pipeline. More complete code is obtained by a CGI program of the 
current version of PGPG in a website (http://progrape.jp). 

A. 2.1. VHDL Code 

PGPG generates description files of the designed hardware logic in VHDL. The hardware 
logic includes the pipeline logic itself and its peripheral logic. Figures 9 and 10 show a part 
of the VHDL source files generated by PGPG (the total length is about 2800 lines). The 
design software provided by the FPGA manufacture creates configuration data of FPGA from 
the generated sources [Step (C) in figure 3]. The configuration data are downloaded into the 
programmable GRAPE hardware using the interface program also generated by PGPG. 

A. 2. 2. Interface Functions 

The interface software on the host computer to the programmable GRAPE hardware is 
composed by the C compiler of the sources generated by PGPG [Step(D') in figure 3]. Figure 
11 shows a part of source files in C generated by PGPG (the total length is about 150 lines). 
We run the application program linked with the interface software [Step(E)]. 

A. 2.3. Emulator Code 

Figures 12 and 13 show a part of the source files generated by PGPG (the total length 
is 580 lines). 
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library ieee; 

use ieee . std_logic_1164 . all ; 
use ieee . std_logic_unsigned. all ; 

entity pipe is 

generic (JDATA_WIDTH : integer :=72) ; 
port(p_jdata : in std_logic_vector ( JDATA_WIDTH-1 downto 0) ; 

p_run : in std_logic; 

p_we : in std_logic; 

p_adri : in std_logic_vector (3 downto 0); 
p_adrivp : in std_logic_vector (3 downto 0); 
p_datai : in std_logic_vector (31 downto 0); 
p_adro : in std_logic_vector (3 downto 0); 
p_adrovp : in std_logic_vector (3 downto 0); 
p_datao : out std_logic_vector (31 downto 0); 
p_runret : out std_logic; 
rst.pclk : in std_logic) ; 
end pipe; 

architecture std of pipe is 
begin 

process (pclk) begin 

if (pclk' event and pclk='l') then 
jdatal <= p_jdata; 

end if ; 
end process; 

process (pclk) begin 

if (pclk' event and pclk='l') then 
if (vmp_phase = "0000") then 

xj(31 downto 0) <= jdatal (31 downto 0); 
yj(31 downto 0) <= jdatal (63 downto 32); 
zj (31 downto 0) <= p_jdata(31 downto 0); 
mj(16 downto 0) <= p_jdata(48 downto 32); 
end if ; 
end if ; 
end process; 



uO: 


Pg- 


f ix_sub_32_l 


port map (x=>xi ,y=>xj ,z=>xij , clk=>pclk) ; 


ill: 


Pg- 


.f ix_sub_32_l 


port map (x=>yi ,y=>yj ,z=>yij , clk=>pclk) ; 


u2: 


Pg- 


.f ix_sub_32_l 


port map (x=>zi , y=>zj , z=>zij , clk=>pclk) ; 


u3: 


Pg- 


conv_ftol_32_ 


.17_8_4 port map (f ixdata=>xij , logdata=>dx, clk=>pclk) 


u4: 


Pg- 


conv_ftol_32_ 


.17_8_4 port map (f ixdata=>yij , logdata=>dy , clk=>pclk) 


u5: 


Pg- 


conv_ftol_32_ 


_17_8_4 port map (f ixdata=>zij , logdata=>dz, clk=>pclk) 



file shown in Figure 8. Component and signal declaration sentences are omitted. 



end std; 



Fig. 9. A part of the source files in VHDL 




library ieee; 

use ieee . std_logic_1164 . all ; 

entity pg_conv_f tol_32_17_8_4 is 

port (f ixdata : in std_logic_vector (31 downto 0); 

logdata : out std_logic_vector (16 downto 0); 

elk : in std_logic) ; 
end pg_conv_ftol_32_17_8_4; 

architecture rtl of pg_conv_f tol_32_17_8_4 is 
begin 

dl <= NOT f ixdata(30 downto 0) ; 

one <= "0000000000000000000000000000001"; 

ul: lpm_add_sub generic map (LPM_WIDTH=>31,LPM_DIRECTI0N=>"ADD") 

port map(result=>d2,dataa=>dl,datab=>one) ; 
dO <= f ixdata(30 downto 0) ; 
signO <= f ixdata(31) ; 

with signO select 
d3 <= dO when ' ' , 
d2 when others ; 

process (elk) begin 

if (elk' event and clk='l') then 
d3r <= d3; 
signl <= signO; 
end if ; 
end process; 

u2: penc_31_5 port map (a=>d3r , c=>cl) ; 
with d3r select 

nzO <= '0' when "0000000000000000000000000000000", 
' 1 ' when others ; 



end rtl; 

Fig. 10. A part of the source file in VHDL (part 2) for the pipeline logic generated from the design entry 
file shown in Figure 8 (part 2). Component and signal declaration sentences are omitted. 
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#include <stdio.h> 
#include <math.h> 

void force(double x[][3], double m[] , double eps2, double a[][3], int n) 
{ 

npipe = 4; 
pgpgi_initial() ; 
pgpgi_setxj (n,x,m) ; 

f or(i=0 ; i<n; i+=npipe) { 
if ( (i+npipe) >n){ 

nn = n - i ; 
}else{ 

nn = npipe; 

> 

pgpgi_setxi(i,nn,x,eps2) ; 
pgpgi_run(n) ; 
pgpgi_getf orce(i,nn,a) ; 

} 

} 

void pgpgi_setxj (int n, double x[][3], double m[]) 
{ 

devid = 0; 
for(j=0;j<n;j++){ 

xj = ((unsigned int) (x[j][0] * (pow(2 . ,32 . 0) / (64 . 0) ) + 0.5)) k Oxffffffff; 
yj = ((unsigned int) (x[j][l] * (pow(2 . ,32 . 0) / (64 . 0) ) + 0.5)) k Oxffffffff; 
zj = ((unsigned int) (x[j][2] * (pow(2 . ,32 . 0) / (64 . 0) ) + 0.5)) k Oxffffffff; 
if(m[j] == 0.0H 

mj = 0; 
}else if(m[j] > 0.0){ 

mj = (((int)(pow(2.0,8.0)*log(m[j]*(pow(2.0,60.0)/(1.0/1024.0)))/log(2.0))) k 0x7fff) I 0x8000; 
}else{ 

mj = (((int)(pow(2.0,8.0)*log(-m[j]*(pow(2.0,60.0)/(1.0/1024.0)))/log(2.0))) k 0x7fff) I 0x18000; 

} 

nword = 4; 

jpdata[0] = OxffcOO; 
jpdata[l] = 2*j+l ; 

jpdata[2] = 0x0 I ((Oxffffffff k xj) « 0) ; 
jpdata[3] = 0x0 I ((Oxffffffff k yj) « 0) ; 
g6_set _jpdat a (devid, nword, jpdata) ; 

jpdata[l] = 2*j+0; 

jpdata[2] = 0x0 I ((Oxffffffff k zj) « 0) ; 
jpdata[3] = 0x0 I ((Oxlffff k mj) « 0) ; 
g6_ set _ jpdata (devid, nword, jpdata) ; 

} 

} 



Fig. 11. A part of source file in C for the interface program generated from the design entry file shown 
in Figure 8. Variable declaration sentences are omitted. 
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#include <stdio.h> 
#include <math.h> 

void force(double x[][3], double m[] , double eps2, double a[][3], int n) 
{ 

for(i=0;i<n;i++){ 

xi = ((unsigned int) (x [i] [0] * (pow(2 . 0,32 .0)/64.0) + 0.5)) k Oxffffffff; 
yi = ((unsigned int) (x[i][l] * (pow(2 . 0,32 .0)/64. 0) + 0.5)) k Oxffffffff; 
zi = ((unsigned int) (x[i][2] * (pow(2 . 0,32 .0)/64. 0) + 0.5)) k Oxffffffff; 
if(eps2 == 0.0) { 

ieps2 = 0; 
}else if(eps2 > 0.0) { 

ieps2 = (((int) (pow(2. 0,8. 0) *log(eps2* ( (pow(2 . 0,32. 0) /64 . 0) * (pow(2 . 0,32. 0)/64 . 0) ) ) /log(2 . 0) ) ) & 0x7fff) I 0x8000; 
}else{ 

ieps2 = (((int)(pow(2.0,8.0)*log(-eps2*((pow(2.0,32.0)/64.0)*(pow(2.0,32.0)/64.0)))/log(2.0))) k 0x7fff) I 0x18000; 

} 

sx = 0; 
sy = 0; 
sz = 0; 

for(j=0;j<n;j++){ 

xj = ((unsigned int) (x[j][0] * (pow(2 . ,32 . 0) /64 . 0) + 0.5)) k Oxffffffff; 
yj = ((unsigned int) (x[j][l] * (pow(2 . ,32 . 0) /64. 0) + 0.5)) k Oxffffffff; 
zj = ((unsigned int) (x[j][2] * (pow(2 . ,32 .0)/64.0) + 0.5)) k Oxffffffff; 
if(m[j] == 0.0H 

mj = 0; 
}else if(m[j] > 0.0){ 

mj = (((int)(pow(2.0,8.0)*log(m[j]*(pow(2.0,60.0)/(1.0/1024.0)))/log(2.0))) k 0x7fff) I 0x8000; 
}else{ 

mj = (((int) (pow(2. 0,8. 0) *log(-m [j] * (pow(2 . 0,60. 0)/(l. 0/1024. 0) ) ) /log(2 . 0) ) ) k 0x7fff) I 0x18000; 

} 

pg_f ix_sub_32(xi ,xj ,&xij ) ; 
pg_fix_sub_32(yi,yj ,&yij) ; 
pg_f ix_sub_32(zi ,zj ,&zij ) ; 
pg_conv_ftol_f ix32_logl7_man8(xij , &dx) ; 
pg_conv_ftol_f ix32_logl7_man8(yij ,&dy) ; 
pg_conv_f tol_f ix32_logl7_man8(zij ,&dz) ; 



pg_f ix_accum_f 57_s64(f f x,&sx) ; 
pg_f ix_accum_f 57_s64(f f y ,&sy) ; 
pg_f ix_accum_f 57_s64(f f z,&sz) ; 

} 

a[i] [0] = ((double) (sx«0) ) * (- (pow(2 . , 32 . 0) /64. 0) * (pow(2 . 0, 32 . 0)/64 . 0)/ (pow(2 . , 60 . 0) / ( 1 . 0/1024 . 0) ) ) /pow(2 . , . 0) 
a[i] [1] = ((double) (sy«0) ) * (- (pow(2 . , 32 . 0) /64. 0) * (pow(2 . 0, 32 . 0)/64 . 0)/ (pow(2 . , 60 . 0) / (1 . 0/1024 . 0) ) ) /pow(2 . , . 0) 
a[i] [2] = ((double) (sz«0) ) * (- (pow(2 . , 32 . 0) /64 . 0) * (pow(2 . 0, 32 . 0)/64 . 0)/ (pow(2 . , 60 . 0) / (1 . 0/1024 . 0) ) ) /pow(2 . , . 0) 

} 

} 

Fig. 12. The source file (for top architecture) in C for bit-level emulator generated from the design entry 
file shown in Figure 8. Variable declaration sentences are omitted. 
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#include<stdio . h> 
#include<math . h> 

void pg_conv_f tol_f ix32_logl7_man8(int fixdata, int* logdataH 
/* SIGN BIT */ 

fixdata_msb = Oxl&(fixdata »31); 
logdata_sign = f ixdata_msb; 

/* ABSOLUTE */ 

fixdata_body = 0x7FFFFFFF & fixdata; 
{ 

int inv_f ixdata_body=0; 

inv_f ixdata_body = 0x7FFFFFFF ~ f ixdata_body ; 
if (f ixdata_msb == Oxl){ 

abs = 0x7FFFFFFF & (inv_f ixdata_body + 1) ; 
}else{ 

abs = f ixdata_body ; 

} 

} 

abs_decimal = 0x3FFFFFFF& abs; 

/* GENERATE NON-ZERO BIT (ALL BIT OR) */ 

if (abs != OxO) { logdata_nonzero = Oxl; }else{ logdata_nonzero=OxO; } 

{ /* PRIORITY ENCODER */ 
int i ; 

int count=0 ; 
for(i=31;i >=0;i— ){ 
int buf ; 

buf = Oxl & (abs »i) ; 

if (buf == Oxl) { count = i; break;} 

count = i ; 

} 

penc_ out = count ; 

> 

penc_out = OxlF & penc_out ; /* 5-bit */ 



Fig. 13. A part of the source file (for modules) in C for the bit-level emulator generated from the design 
entry file shown in Figure 8. Variable declaration sentences arc omitted. 
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